InterPepScore: a deep learning score for improving the FlexPepDock refinement protocol

Abstract Motivation Interactions between peptide fragments and protein receptors are vital to cell function yet difficult to experimentally determine in structural details of. As such, many computational methods have been developed to aid in peptide–protein docking or structure prediction. One such method is Rosetta FlexPepDock which consistently refines coarse peptide–protein models into sub-Ångström precision using Monte-Carlo simulations and statistical potentials. Deep learning has recently seen increased use in protein structure prediction, with graph neural networks used for protein model quality assessment. Results Here, we introduce a graph neural network, InterPepScore, as an additional scoring term to complement and improve the Rosetta FlexPepDock refinement protocol. InterPepScore is trained on simulation trajectories from FlexPepDock refinement starting from thousands of peptide–protein complexes generated by a wide variety of docking schemes. The addition of InterPepScore into the refinement protocol consistently improves the quality of models created, and on an independent benchmark on 109 peptide–protein complexes its inclusion results in an increase in the number of complexes for which the top-scoring model had a DockQ-score of 0.49 (Medium quality) or better from 14.8% to 26.1%. Availability and implementation InterPepScore is available online at http://wallnerlab.org/InterPepScore. Supplementary information Supplementary data are available at Bioinformatics online.


Sequence Embeddings
Embeddings of protein sequence can achieve stronger performance than simply using BLOSUM62 columns given enough training data (ElAbd et al., 2020). However, the number of unique sequences in the training set of this study was too low to train a reasonably generalizable embedder. There exists several pretrained embedders of protein sequence which have previously been used to great effect in protein structure prediction tasks through transfer learning, such as the embedding of Bepler and Berger (Bepler and Berger, 2019) or ProtBert (Elnaggar et al., 2020). Different versions of InterPepScore were trained using the outputs from these embeddings as vertex features rather than the BLOSUM62 matrix column. Indeed, using these features did lead to better performance on the training data, overfitting occured earlier resulting in higher validation loss compared to BLOSUM62, even with highly aggressive regularization (Table S1) FlexPepDock refinement Figure S1: Hexagon density plot of refinement results. For every single run of FlexPepDock for every starting position (not only the top runs), the DockQscore of the decoy produced compared to its starting position. The density color-gradient is log-scaled.

Extended Analysis
FlexPepDock with or without InterPepScore frequently generates decoys with worse DockQ-scores than their starting positions, when all generated decoys are considered, and not only the top decoys selected using the best scores, Figure  S1. This is expected as FlexPepDock is a Monte-Carlo based approach which often end up in local minima or unfavorable positions, requiring the protocol to be run many times and relying on the scoring function to select structures similar to a native structure. Also visible in the same figure is the fact that FlexPepDock refinement with InterPepScore produces structures with on average higher absolute difference in DockQ as compared to their starting structures than FlexPepDock without InterPepScore does: with an absolute difference of 0.059 on average compared to 0.053. Comparing InterPepScore to the reweighted score of FlexPepDock; since InterPepScore evaluates on the same scale independent on protein complex specifics, it also correlates better with the final DockQ of the peptide complex (correlation R: 0.394), compared to the reweighted score normalized per target (correlation R: 0.184). Figure 5 of the main paper shows the correlation between DockQ of final top 1 selected models per complex generated by FlexPepDock with and without InterPepScore. The correlation in DockQ between the best models among top 10 generated per starting position is shown in Figure S2. The largest improvement in DockQ score is when the DockQ for InterPepScore is around and some above 0.5.   with FlexPepDock with and without InterPepScore, for different numbers of models generated. At each number of models generated, a different random selection of that many models of the total 20 000 generated were evaluated (per complex). Interface Score (I sc) was used, as it showed slightly better performance than reweighted score (reweighted sc) when selecting top models from larger sets.
fraction complexes with best of top 10 passing DockQ thresholds Figure S4: Fraction of complexes where the best of top 10 models has a DockQ value over the thresholds for 0.23, 0.49, and 0.80 corresponding to acceptable (whole color), medium (striped) and high quality (crossed), respectively. The left-aligned columns with lighter coloring denote FlexPepDock with InterPep-Score, while the darker, right-aligned, columns denote FlexPepDock without InterPepScore.

Increased Sampling for 15 targets
For 15 randomly selected targets of the 109 of the larger test set, FlexPepDock with and without InterPepScore was run 20 000 times to analyze differences in runtime and at what point results converge. The distributions of DockQ over different number of models generated can be found in Figure S3.

Length of Peptide
The addition of InterPepScore seems to have a larger positive influence on complexes with larger peptides, Figure S4. In this figure, it is evident that larger peptides are more difficult to model for regular FlexPepDock refinement, and that peptides around length 15-20 sees the largest improvement from addition of InterPepScore. 3 FlexPepDock refinement with InterPepScore of AlphaFold2 models To test the applicability of FlexPepDock refinement with InterPepScore on models generated by the state-of-the-art simultaneous folding and docking protocols, it was run on models generated by AlphaFold-Multimer-v1 (Evans et al., 2021), run without access to structural templates, the results of which can be found in the main paper Figure 4. The top model only, as ranked by AlphaFold's ranking score was used for each complex. Figure S5 shows the average differences in comparison to the starting positions. A similar test was also run using the docking approach including AlphaFold2 for monomers proposed by (Tsaban et al., 2021). This docking involves using a polyglycine linker between the receptor and peptide to submit the complex to the AlphaFold2 inference step as one single protein chain. As can be seen in Figure S6, using FlexPepDock refinement with InterPepScore improves the quality of these models much like for the true AlphaFold-Multimer structures ( Figure 5 of main paper). Nr worsened: 5 Figure S6: The capacity for FlexPepDock refinement including the InterPep-Score score term to improve the quality of models created by using AlphaFold2 with a polyglycine linker as proposed in Tsaban et al. (2021). Points outside the shade area are significantly (>2 standard deviations from 0) changed, positive differences are improved relative to the starting pose.